Kalpesh Padia, Old
Dominion University, kpadia@cs.odu.edu
[PRIMARY contact]
Dr. Michele C. Weigle,
Old Dominion University [Faculty Advisor], mweigle@cs.odu.edu
Video:
Video can be found here.
ANSWERS:
MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.
We
conclude that the outbreak started from Uptown, picked up at Downtown (Figure
1) spreading first to Eastside and later infected large number of people in
other parts of the city (Figure 2).
First
the huge dataset was filtered to prune noise and a new dataset containing only
records hinting illness was generated using a Perl script with focus on
keywords related to symptoms mentioned in the task. Only references to
bloggers’ own illness were considered. Any use of term “pain” in the context of
emotional pain was removed manually. Filtered dataset was input to the .NET
application for visualization. Though a few complained of flu as early as April
30th it doesn’t become an epidemic until May 18th (Figure
1) when it quickly starts to spread to different parts of the city.
Figure 1. Epidemic spread on
May 18, 2011 morning.
Figure 2. Epidemic spread on
May 19, 2011.
MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is being transmitted. For example, is the method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy treatment resources outside the affected area? Explain your reasoning.
We
believe that this illness is spreading primarily from person-to-person due to coughing
and sneezing, similar to normal flu. As a result, the outbreak is not contained
to a specific region and people in different parts of the city (and potentially
outside the city – if infected individuals choose to travel) risk being
infected. It is therefore necessary for emergency management personnel to
deploy treatment resources outside the affected area.
To
analyze the spread of epidemic, we first filtered the dataset and then fed it
as input to our .NET application to visualize the location of infected persons
on the city map as time progresses. This gave us a basic idea about how
infected the various parts of the city were, how fast the epidemic was
spreading and also allowed us to figure out if the weather played any role in
the spread of the epidemic.
In
the beginning, the infection appears to be spreading evenly across the city
(Figure 3) but later we observe that more people are infected first at
Eastside, and later at Villa, Westside, Smogtown and Plainsville, followed by more infection in other parts of
the city. The peak of the epidemic appears to be on May 18th when a
large number of people get infected at Downtown and Eastside (Figure 4) while
the wind is blowing towards West at high speed. Correlating these observations
with the weather data, we conclude that weather does not play a role in spread
of infection. Further hardly anyone fell ill around the lakes and reservoir
suggesting that the epidemic was not waterborne as well.
Figure 3. Epidemic spread
around May 8, 2011.
Figure 4. Epidemic spread in
the afternoon of May 18th, 2011.
We
then decided to correlate this observation with ratio of daytime population to population
density in various parts of the city. While Uptown and Downtown have high ratio
of 3.9 and 2.9 respectively, the ratio of Cornertown,
Villa and Smogtown is closer to 1. The ratio for
other parts of the city is 0.8 or less. On May 18th number of infected
people at Downtown and Eastside grows steadily during daytime until 6 PM while
no one (save a few) fell ill in the west part of the city. The high daytime
density ratio of these regions is probably responsible for this. After 6 PM
infection starts to pick up in the various parts of city (with smaller ratios)
as well. This further suggests that as people return to their homes, they
infect others. Over the next two days, as the large number of infected people
are fatigued and stay at home, they infect more people at Villa, Smogtown and Plainsville (Figure
5). High density ratio of Westside makes it the new hub of infection alongside
Downtown where people continue to get infected. Also, it can be observed that
by mid-day of May 18th, people in Downtown and Eastside have already
started to seek medical attention at the hospitals while people in other parts
of the city do the same over the next two days. These observations conclude
that it is transmitted from person to person.
Figure 5. Epidemic spread on
May 20th, 2011. Note that people have already started to seek
medical attention.
Further
we imported our filtered dataset into a MySQL database and calculated term
frequencies (TF) and term frequencies-inverse document frequencies (TF-IDF) for
the terms appearing in the dataset. For the purpose of calculation of TF and
TF-IDF, we considered all blogs generated in a day to be a single document
which created a document corpus containing 21 documents, one for each day. We
created a web interface using Ruby on Rails, jQuery
and JavaScript to visualize the top terms in the dataset as a time cloud with
the hope of observing some trends in illness over time (Figure 6).
Figure 6. TF and TF-IDF time
cloud on filtered dataset for May 18th, 2011.
While
TF was helpful to some extent in observing the severity of the symptoms over
time, TF-IDF was not as it bubbled the not-so common terms upwards. We noticed
that while people complain about body pain, fever and flu on almost daily
basis, coughing as a symptom becomes prominent every other day. Also the
symptoms get worse on May 18th when the frequencies of these terms
become much higher and also the term “worse”, “night” and “wish” appear hinting
at worsening symptoms. Over the next two days the patients also develop
diarrhea. We found this interface was good for observing a trend in symptoms
and it further supported our hypothesis that the illness was being spread due
to coughing from person-to-person.
We
also imported the entire data set into the database and created another web
interface to visualize the movement of an infected individual across the city
over time after he becomes infected (Figure 7). We can either select a user id
to plot a user’s movement after infection across the city or select a date to
see who all were infected on a particular day. Also, an individual can be
located on the map on a specific date after he has been infected. We observed
that most people have been moving across the city once they are infected, thus
infecting more people.
Figure
7.
Plot showing how user 89943 moves across city after getting infected on May
5th, 2011.
Notice that the user
visits various parts of the city over the next 15 days possibly transmitting
the infection.
All
above observations support our hypothesis that the epidemic is spreading from
person-to-person and that it is not contained to a specific region.